Using the tidyverse style guide.
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
library(viridis)
eujingc_315_theme <- theme_bw() +
theme(axis.text = element_text(size = 12),
plot.title = element_text(size = 20, face = "bold", hjust = 0),
plot.subtitle = element_text(size = 14, face = "italic", hjust = 0),
text = element_text(size = 14, face = "bold", color = "darkslategrey"))
library(MASS)
library(tidyverse)
library(GGally)
data(Cars93)
cont_cols <- which(names(Cars93) %in%
c("Cars93", "Price", "MPG.city",
"MPG.highway", "EngineSize",
"Horsepower", "RPM", "Fuel.tank.capacity", "Passengers",
"Length", "Wheelbase", "Width", "Turn.circle", "Weight"))
ggparcoord(Cars93, columns = cont_cols, showPoints = TRUE, alphaLines = 0.3) +
aes(color = factor(Type)) +
labs(color = "Car Type", x = "Variables", y = "Value",
title = "Parallel Coordinates of Continuous Variables for Cars",
subtitle = "Centered and standardized by standard deviation") +
scale_color_viridis(discrete = TRUE) +
eujingc_315_theme +
theme(axis.text.x = element_text(angle = 45))
Each line in the plot corresponds to an observation in our data, with the x-axis corresponding to each continuous variable and the y-axis corresponding to the standardized score of the continuous variable for each observation.
We can see that cars of type 4 tend to have higher mileage than the other types, while cars of type 6 tend to accomodate more passengers.
The parallel coordinates plot is easier to read as we can follow each line that corresponds to each observation more easily.
ggparcoord(Cars93, columns = cont_cols, showPoints = TRUE, alphaLines = 0.3) +
aes(color = factor(Type)) +
coord_polar() +
labs(color = "Car Type", x = "Variables", y = "Value",
title = "Parallel Coordinates of Continuous Variables for Cars",
subtitle = "Centered and standardized by standard deviation") +
scale_color_viridis(discrete = TRUE) +
eujingc_315_theme +
theme(axis.text.x = element_text(angle = 45))
scale = "center".ggparcoord(Cars93, columns = cont_cols, showPoints = TRUE, alphaLines = 0.3,
scale = "center") +
aes(color = factor(Type)) +
labs(color = "Car Type", x = "Variables", y = "Value",
title = "Parallel Coordinates of Continuous Variables for Cars",
subtitle = "Scaled by range and centered by mean") +
scale_color_viridis(discrete = TRUE) +
eujingc_315_theme +
theme(axis.text.x = element_text(angle = 45))
City mileage is positively correlated with highway mileage. We can see how cars with high city mileage with positive scaled values corresponding with high highway mileage with positive scaled values too.
The highway mileage is negatively correlated with the engine size. We can see how cars with high mileage having positive scaled values have negative scaled values for engine size.
cars_cont <- dplyr::select(Cars93, Price, MPG.city, MPG.highway, EngineSize,
Horsepower, RPM, Fuel.tank.capacity, Passengers,
Length, Wheelbase, Width, Turn.circle, Weight)
library(reshape2)
correlation_matrix <- cor(cars_cont)
melted_cormat <- melt(correlation_matrix)
ggplot(data = melted_cormat, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "darkred", mid = "lightgrey", high = "darkblue") +
labs(title = "Correlation of Variable Pairs",
subtitle = "For continuous variables only",
fill = "Correlation") +
eujingc_315_theme +
theme(axis.text.x = element_text(angle = 90),
axis.title.x = element_blank(), axis.title.y = element_blank())
City and highway mileage are very highly positively correlated, along with fuel tank capacity and weight, and width and weight. Weight is also very highly negatively correlated with city and highway mileage, along with fuel tank capacity and city and highway mileage. There is approximately no correlation between passengers and horsepower, passengers and price, RPM and price, and RPM and horsepower.
This plot is related to a heat map because we map high a low values with some color gradient, and then visualize the values using the mapped colors.
This plot reminds me of a mosaic plot for categorical variables.
correlation_matrix <- cor(cars_cont)
dd <- as.dist((1 - correlation_matrix) / 2)
hc <- hclust(dd)
correlation_matrix <- correlation_matrix[hc$order, hc$order]
correlation_matrix[lower.tri(correlation_matrix)] <- NA
melted_cormat <- melt(correlation_matrix)
ggplot(data = melted_cormat, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, 2)), size = 3) +
scale_fill_gradient2(low = "darkred", mid = "lightgrey", high = "darkblue",
limits = c(-1, 1)) +
labs(title = "Correlation of Variable Pairs",
subtitle = "For continuous variables only",
fill = "Correlation") +
eujingc_315_theme +
theme(axis.text.x = element_text(angle = 90),
axis.title.x = element_blank(), axis.title.y = element_blank())
(20 points)
Variable Dendrograms
Another way to visually explore potential associations between continuous variables in our dataset is with dendrograms.
But we will compute a dendogram on features instead of data points. In other words, we cluster vraiables rather than cars.
Cars93 dataset. To do this:cormat <- 1 - abs(cormat)as.dist() function.hclust()), convert the result to a dendrogram (as.dendrogram()), then plot with ggplot().library(dendextend)
cormat <- cor(cars_cont)
cormat <- 1 - abs(cormat)
cormat <- as.dist(cormat)
hc <- hclust(cormat)
dg <- as.dendrogram(hc)
dg <- dg %>% set("labels_col", cutree(hc, k = 4))
ggplot(dg)
(5 points) Examine the four-cluster solution. Which variables are in the same cluster? Does it make sense that these are in the same cluster, given both your common-sense understanding of these variables and given the correlation plot you created in Problem 2?
We could use euclidean distances between the mean of each continuous variable.
If you’re finding your graphic runs over the boundaries, try standard approaches of ylim and xlim changes. Additionally, sometimes the knitted file looks different - knit once before you run.
Use this command
g = barabasi.game(1000,power=1,directed=FALSE)
to create a random graph that is similar to some real social networks.
library(igraph)
g <- barabasi.game(1000, power = 1, directed = FALSE)
plot(g, layout = layout_nicely)
library(ggraph)
ggraph(g, layout = "stress") +
geom_node_point() +
geom_edge_link()
(2 points) Compute and report the key summary statistics of the graph.
(6 points) Try two different igraph clustering algorithms. Explain what they are doing. Show the resulting plots and say what the modularity is. Which algorithm do you think worked better and why?
(2 points) Draw a histogram of betweeness(g). What is the interpretation of this?
data(iris)
X <- log(iris[, 1:4])
out <- prcomp(X, center = TRUE, scale = TRUE)
biplot(out)
ggplot(as.data.frame(out$x[, 1:2]),
aes(x = PC1, y = PC2, color = iris$Species)) +
geom_point() +
labs(x = "PC1", y = "PC2", color = "Species",
title = "2D Projection of Iris Data",
subtitle = "Using First 2 Principal Components") +
scale_color_viridis(discrete = TRUE) +
eujingc_315_theme
library(kernlab)
outk <- kpca(X)
(2 points) Now repeat the analysis using tsne. Try perplexity = 30 and perplexity = 5. What happens as you change the perplexity?
(2 points) Load the mlbench library and use this command:
X = 4*mlbench.spirals(100,1,.025)$x
Plot the data.
sc = specc(X, centers=2)
from kernlab to perform spectral clustering. Plot the data and use two colors to indicate the result of the spectral clustering. Now repeat using kmeans clustering. Why did the spectral clustering work better?
(1 point each)
Arc Pie Charts
Install and load the ggforce package. This package implements several updates and improvements to ggplot2.
Type variable in the Cars93 dataset. (Code provided.)library(ggforce)
Cars93 %>% group_by(Type) %>%
summarize(count = n()) %>%
mutate(max = max(count),
focus_var = 0.2 * (count == max(count))) %>%
ggplot() + geom_arc_bar(aes(x0 = 0, y0 = 0, r0 = 0.8, r = 1,
fill = Type, amount = count),
stat = 'pie')
Adjust the r0 parameter to lower and higher values. What does this control? What is the minimum and maximum value?
Recreate the graph from (a), but this time, add explode = focus_var into your call to aes(). What does this do?
Recreate the graph from (c), but this time, add focus to the category with the minimum number of observations.
Critique these graphs.
(5 points each)
Zoom
See the following code working with the IMDb movies dataset for how to use facet_zoom().
library(tidyverse)
library(forcats)
library(devtools)
library(ggforce)
# Colorblind-friendly color pallette
my_colors <- c("#000000", "#56B4E9", "#E69F00", "#F0E442", "#009E73", "#0072B2",
"#D55E00", "#CC7947")
# Read in the data
imdb <- read_csv("https://raw.githubusercontent.com/mateyneykov/315_code_data/master/data/imdb_test.csv")
# get some more variables
imdb <- mutate(imdb, profit = (gross - budget) / 1000000,
is_french = ifelse(country == "France", "Yes", "No")) %>%
filter(movie_title != "The Messenger: The Story of Joan of Arc")
france_1990 <- filter(imdb, country == "France", title_year >= 1990)
# this code plots a scatterplot + a zoomed facet
ggplot(data = imdb, aes(x = title_year, y = profit)) +
geom_point(color = my_colors[1], alpha = 0.25) +
geom_smooth(color = my_colors[2]) +
geom_point(data = france_1990, color = my_colors[3]) +
geom_smooth(data = france_1990, aes(x = title_year, y = profit),
color = my_colors[4], method = lm) +
facet_zoom(x = title_year >= 1990) +
labs(title = "Movie Profits over Time",
subtitle = "Zoom: French Movies from 1990 -- 2017 (orange/yellow)",
caption = "Data from IMDB and Kaggle",
x = "Year of Release",
y = "Profit (millions of USD)")
Also read the articles here, or here.
Recreate any scatterplot that we created throughout the year, and zoom in on a section of the graph via the facet_zoom() feature in the newest version of the ggforce package. Include a title, subtitle, and caption in the resulting graph. The caption should just state the data source, and the subtitle should explain what area of the plot is being enhanced via zooming.
Interpret the resulting graph: Describe some feature of the new version of the graph that you may not have been able to see very well in the previous version of the same graph (without zooming).